Organizations
create disaster recovery plans and procedures to protect against a
variety of system failures, but disk failures tend to be the most common
in networking environments. The technology used to create processor
chips and memory chips has improved drastically over the past couple
decades, minimizing the failure of system boards. And although the
quality of hard drives has also drastically improved over the years,
because hard drives are constantly spinning, they have the most moving
parts in a computer system and tend to be the items of most failure.
Key to a disk
fault-tolerant solution is creating hardware fault tolerance on key
server drives that can be recovered in case of failure. Information is
stored on system, boot, and data volumes that have varying levels of
recovery needs. Many options exist such as storage area networks (SANs)
or various RAID levels to minimize the impact of drive failures.
Important to note is that
Exchange Server 2010 environments built with a DAG architecture are much
less impacted by single server failures. Microsoft suggests doing away
with local RAID configurations and utilizing the application layer
redundancy to protect against system failures. In some cases, the
reduction in disks and expensive RAID controllers will offset the costs
associated with building servers for a redundant site. This should be
taken into consideration when designing for server resilience.
Hardware-Based RAID Array Failure
Common uses of
hardware-based disk arrays for Windows servers include RAID 1
(mirroring) for the operating system and RAID 5 (striped sets with
parity) for separate data volumes. Some deployments use a single RAID-5
array for the OS, and data volumes for RAID 0+1 (mirrored striped sets)
have been used in more recent deployments.
RAID controllers provide
a firmware-based array-management interface, which can be accessed
during system startup. This interface enables administrators to
configure RAID controller options and manage disk arrays. This interface
should be used to repair or reconfigure disk arrays if a problem or
disk failure occurs.
Many controllers offer
Windows-based applications that can be used to manage and create arrays.
Of course, this requires the operating system to be started to access
the Windows-based RAID controller application. Follow the manufacturer’s
procedures on replacing a failed disk within hardware-based RAID
arrays.
Note
Many
RAID controllers enable an array to be configured with a hot spare
disk. This disk automatically joins the array when a single disk failure
occurs. If several arrays are created on a single RAID controller card,
hot spare disks can be defined as global and can be used to replace a
failed disk on any array. As a best practice, hot spare disks should be
defined for arrays.
System Volume
If a system disk
failure is encountered, the system can be left in a completely failed
state. To prevent this problem from occurring, the administrator should
always try to create the system disk on a fault-tolerant disk array such
as RAID 1 or RAID 5. If the system disk was mirrored (RAID 1) in a
hardware-based array, the operating system will operate and boot
normally because the disk and partition referenced in the boot.ini file
will remain the same and will be accessible. If the RAID-1 array was
created within the operating system using Disk Manager or diskpart.exe, the mirrored disk can be accessed upon bootup by choosing the second option in the boot.ini
file during startup. If a disk failure occurs on a software-based
RAID-1 array during regular operation, no system disruption should be
encountered.
Boot Volume
If Windows Server 2008
has been installed on the second or third partitions of a disk drive, a
separate boot and system partition will be created. Most manufacturers
require that for a system to boot up from a volume other than the
primary partition, the partition must be marked active before
functioning. To satisfy this requirement without having to change the
active partition, Windows Server 2008 always tries to load the boot
files on the first or active partition during installation, regardless
of which partition or disk the system files will be loaded on. When this
drive or volume fails, if the system volume is still intact, a boot
disk can be used to boot into the OS and make the necessary modification
after changing the drive.
Data Volume
A data volume is by far the
simplest of all types of disks to recover. If an entire disk fails,
simply replacing the disk, assigning the previously configured drive
letter, and restoring the entire drive from backup restores the data and
permissions.
A few issues to watch out for include the following:
Setting the correct permissions on the root of the drive
Ensuring that file shares still work as desired
Validating that data in the drive does not require a special restore procedure